Statistical Grand Slams

A Bayesian Approach to Modeling Home Run Production in Major League Baseball

Sam Turner

Introduction

  • In this research project, Dr. Parson and I sought to predict the home run (HR) production of Major League Baseball hitters.

  • If you work for a MLB team, predicting HR’s is important because it is the pinnacle outcome of an at-bat (AB)

  • And even if you don’t work for a MLB team, you could make a lot of money in Vegas if you can accurately predict HR’s!

Introduction Cont.

  • A problem with predicting player HR’s is that there is often limited data available

    • For example, players who have played for more than 6 seasons are considered very “veteran” but only have 6 player-season data points
  • Generalized linear models (GLM’s) from classical statistics have a hard fitting accurate models given limited data points - such as the 6 data point scenario mentioned above

  • This is where a Bayesian model shines - Bayesian models can “learn” from the data improving the effective observations it has for fitting

Fellingham and Fisher (2017)

Pros: Bayesian model

  • Bayesian model

Fellingham and Fisher (2017) Cont.

Cons:

  • Sparingly uses multilevel modeling

    • Only use multilevel modeling on their orthogonal quartic polynomial
  • Priors are uninformative and don’t fit data

    • Expect average player’s HR probability to be 0.00015

    • Priors effectively suggest that players could have a HR probability between 0 and 1

  • Strange choices of parameters

    • Age, decade of birth, season of play, home ballpark

Model

\[ \begin{align} HR_{nip} &\sim Binomial(AB_{nip},\pi_{nip})\\ \frac{\pi_{nip}}{1-\pi_{nip}} &= \alpha_{n}+\beta_n\cdot (Age_{ni}-30)+\eta_n\cdot (Age_{ni}-30)^2\\ &+\delta_p+\xi_i \end{align} \]

The amount of HR’s hit by player \(n\) in year \(i\) played at park \(p\) is binomially distributed, according to player \(n\)’s AB’s and HR probability \(\pi\) for year \(i\) at park \(p\).

The amount of AB’s in a year is given for our prediction of HR’s, but we say that the probability that the AB results in a HR, \(\pi\) varies according to a number of factors.

Model Cont.

We can both simulate and predict the amount of HR’s of a player. For example, let’s examine a player that has 100 AB’s and a probability \(\pi\) of 0.03.

That is, \(HR \sim Binomial(100,0.03)\),

Simulation:

set.seed(1)
rbinom(5, 100, 0.03)
[1] 2 2 3 5 2

Prediction:

\(E(HR)=AB\cdot \pi=100\cdot0.03=3\)

Data

Our data for the model comes from the Lahman data set. Lahman covers a variety of information on each player, including but not limited to, batting, fielding, team, and player statistics.

In order to be considered in our analysis a player must have accumulated at least 6 seasons of play, with at least 50 AB’s in each, and have played from 1973-2019.

“Innate” HR Hitting Ability - \(\alpha_n\)

\[ \begin{align} \alpha_n&\sim Normal(\mu_0, \sigma_0), n\in\{1,...,657\}\\ \\ \mu_0&\sim Normal(-3.5,0.1)\\ \sigma_0&\sim Exponential(1) \end{align} \]

  • Intercept term which represents “innate” hitting ability of players skilled enough to play in MLB

  • -3.5 on the logit scale is ≈0.029 or 2.9%

  • -3.5±0.1 = [-3.6, -3.4] = [0.0266, 0.0323] = [2.7%, 3.2%]

  • -3.5±2 = [-5.5, -1.5] = [0.0041, 0.182] = [0.4%, 18.2%]

Multilevel Trade-Off - No Pooling

No pooling would mean that each player, \(n\), gets their own \(\alpha\) estimate. That is,

\[ \frac{\pi_n}{1-\pi_n}= \alpha_{PLAYER[n]}+... \]

This intercept would just match the data we had on each player. It probably over-fits from relying on player data too much, which is sparse. You can think of this model suffers from amnesia because it assumes there is nothing in common moving from player to player.

Multilevel Trade-Off - Total Pooling

In a total pooling scenario players share the same intercept \(\alpha\), which is just the overall mean HR probability. There are lots of data, so we can be confidnet of the value that \(\alpha\) would be, however it wouldn’t be effective to apply on a case-per-case basis. This is because few players are likely to \(\alpha\)’s which exactly match the mean of the data.

\[ \frac{\pi_n}{1-\pi_n}=\alpha+...=\overline{HR}+... \]

You can think of this model suffering from over-sharing since it cannot distinguish the \(\alpha\) from player to player.

In essence, we lose information from both no and total pooling.

“Innate” HR Hitting Ability - \(\alpha_n\) (Again)

\[ \begin{align} \alpha_n&\sim Normal(\mu_0, \sigma_0), n\in\{1,...,657\}\\ \\ \mu_0&\sim Normal(-3.5,0.1)\\ \sigma_0&\sim Exponential(1) \end{align} \]

  • Intercept term which represents “innate” hitting ability of players skilled enough to play in MLB

  • -3.5 on the logit scale is ≈0.029 or 2.9%

  • -3.5±0.1 = [-3.6, -3.4] = [0.0266, 0.0323] = [2.7%, 3.2%]

  • -3.5±2 = [-5.5, -1.5] = [0.0041, 0.182] = [0.4%, 18.2%]

HR Hitting by Age

HR Distribution

HR Distribution Cont.

Centered Age Effect - \(\beta_n\)

\[ \begin{align} \beta_n &\sim Normal(\mu_1,\sigma_1), n\in\{1,...,657\}\\ \\ \mu_1 &\sim Normal(0,0.1)\\ \sigma_1 &\sim Exponential(10) \end{align} \]

  • Multiplicative effect representing how deviation from centered age (Age - 30) affects HR hitting ability

  • Age plays a factor in hitting HR’s, but it is likely not very large so we have the priors set near 0 to reflect this

Centered Age Effect Squared - \(\eta_n\)

\[ \begin{align} \eta_n &\sim Normal(\mu_2, \sigma_2),n\in\{1,...,657\}\\ \\ \mu_2 &\sim Normal(0, 0.01)\\ \sigma_2 &\sim Exponential(100)\\ \end{align} \]

  • Multiplicative effect representing how deviation from centered age squared [(Age - 30)²] affects HR hitting ability

  • Used to capture the non-linearity of the data without risk of over-fitting

HR Hitting by Age

Park Effect - \(\delta_p\)

\[ \begin{align} \delta_p &\sim Normal(\mu_5,\sigma_5),p\in\{1,...,88\}\\ \\ \mu_5 &\sim Normal(0,0.01)\\ \sigma_5 &\sim Exponential(10) \end{align} \]

  • Intercept term which captures the effect playing in different parks has on HR probabilities

  • Parks differ by both dimensions and altitude which affects HR rates

Park Overlay

Park Overlay Cont.

Year Effect - \(\xi_i\)

\[ \begin{align} \xi_i &\sim Normal(\mu_6,\sigma_6),i\in\{1,...,47\}\\ \\ \mu_6 &\sim Normal(0,0.25)\\ \sigma_6 &\sim Exponential(10) \end{align} \]

  • Intercept term which captures the effect playing in different years has on HR probability

  • Changes can occur because of rules, ownership goals, player goals, etc.

  • This term captures those changes without asking why there are changes

HR’s by Year in MLB

HR Proportion by Year in MLB

Model (Again I)

\[ \begin{align} HR_{nip} &\sim Binomial(AB_{nip},\pi_{nip})\\ \frac{\pi_{nip}}{1-\pi_{nip}} &= \alpha_{n}+\beta_n\cdot (Age_{ni}-30)+\eta_n\cdot (Age_{ni}-30)^2\\ &+\delta_p+\xi_i \end{align} \]

  • There are 2,116 parameters for this model

    • (3×657)+88+47+10
  • This model is classically non-identifiable because of 3 intercept terms!

Model (Again II)

  • The model predicts that the average player’s \(\pi\) is about 0.03 (on the normal scale) which is what we observe in the data

  • Uses Bayesian techniques to update estimates based on what the data says - allows for inference

Robin Yount - Actual \(\pi\)

Robin Yount - Predicted \(\pi\)

Pat Tabler - Actual \(\pi\)

Pat Tabler - Predicted \(\pi\)

Kendrys Morales - Actual \(\pi\)

Kendrys Morales - Predicted \(\pi\)

Goodness of Fit - Trace-plots

Conclusions

  • Will our model make the Hood math department excellent gamblers?

    • No. But it did a fine job at adapting predictions of players and being “right” on average
  • Areas of future research

    • Player archetypes and physical characteristics

    • Considering more for a longer time interval (like Fellingham and Fisher (2017))

    • Better data (advanced metrics or play-by-plays)

Conclusions Cont.

  • We believe we did better than Fellingham and Fisher (2017), but without access to their model and computational resources it is difficult to determine